-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 #29242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 #29242
Conversation
Signed-off-by: mgoin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for NVFP4 MoE kernels on SM120 architecture using CUTLASS. The changes include adding the new kernel file to the build system, implementing the SM120-specific kernel, and creating a dispatcher to select the correct kernel based on the SM version.
The implementation for the SM120 kernel introduces significant code duplication with the existing SM100 kernel. I've left a comment suggesting a refactoring to improve maintainability by using templates to abstract away the architecture-specific details, similar to patterns seen elsewhere in the codebase. Other changes look good.
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
|
Debugging some things here locally with trace output enabled: I wonder if we're going to end up needed some of the fixes from CUTLASS 4.3.0 such as https://github.com/NVIDIA/cutlass/pull/2789/files#diff-17feae83c50669a768a86d30b7bd9c50feb773c10c38b85b253f15e7462283d0 or https://github.com/NVIDIA/cutlass/pull/2789/files#diff-b743e12813d0ca6abe40a6711013c612bb15f7affeba65c7d7d4392d7e3d406d |
|
So, after some digging around, I got this seemingly working by adjusting |
|
Wow, I couldn't figure out the gemm initialization failure for several hours yesterday. Even CUTLASS debug logs were giving me nothing. If that fixes the issue for me, thank you so much. |
|
I tried explicit template instantiation and that did not work. I tried static inline template and non-templated Here's the gist with what seems to work for me - https://gist.github.com/bbrowning/fb0aac4970a881eba26384d262cd41c9. That's entirely untested on anything but a DGX Spark and the model RedHatAI/Qwen3-30B-A3B-NVFP4. Perhaps a macro could reduce code duplication there. |
|
I don't know what to really expect performance-wise, but this feels like it's taking advantage of the new NVFP4 kernels? The output token/s is higher than I expected... |
Signed-off-by: mgoin <[email protected]>
|
Documentation preview: https://vllm--29242.org.readthedocs.build/en/29242/ |
Signed-off-by: mgoin <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Michael Goin <[email protected]>
|
I pulled the latest changes here onto my SM121 and was able to load RedHatAI/Qwen3-30B-A3B-NVFP4 and RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4 (both NVFP4 MoE models) successfully without having to make any other local changes.Thanks! |
| fusion_args.alpha_ptr_array = | ||
| reinterpret_cast<float**>(alpha_ptrs.data_ptr()); | ||
| fusion_args.dAlpha = {_0{}, _0{}, 1}; | ||
| fusion_args.beta = 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the sm100 version need beta set to 0 also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to be so, the sm120 kernel is a bit unique as seen from the bf16 issue
bnellnm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm not a cutlass expert but it looks close enough to the sm100 version.
| "Failed to initialize GEMM: status=", (int)status, | ||
| " workspace_size=", workspace_size, " num_experts=", num_experts, | ||
| " M=", M, " N=", N, " K=", K); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would it be useful to add a similar detailed message to the SM100 kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to improve this across the board for our cutlass kernels, so will do in a followup
|
lgtm also! |
Signed-off-by: mgoin <[email protected]> Signed-off-by: Michael Goin <[email protected]>
Purpose
Expand
csrc/quantization/fp4/nvfp4_experts_quant.cuandcsrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cuto build for SM120 (RTX 50xx and RTX PRO 6000 Blackwell) so we can support nvfp4 cutlass moe on the platform.Thanks to @AichenF in sgl-project/sglang#11737 for the extension to the SM100 kernel.
Hoping to address #29030, #29141
Test Plan
Eval on RTX 5090
Test Result
Main:
Fails to run due to various kernels not built on SM120 but the cutlass path is taken
PR:
Test compressed-tensors checkpoint https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4
Test modelopt checkpoint https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.